NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

https://doi.org/10.1371/journal.pcbi.1009449

Sarmashghi, Shahab; Balaban, Metin; Rachtman, Eleonora; Touri, Behrouz; Mirarab, Siavash; Bafna, Vineet (November 2021, PLOS Computational Biology)
Segata, Nicola (Ed.)
The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k -mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k -mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k -mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e= .
more » « less
Full Text Available
Computing the Statistical Significance of Overlap between Genome Annotations with iStat

https://doi.org/10.1016/j.cels.2019.05.006

Sarmashghi, Shahab; Bafna, Vineet (June 2019, Cell Systems)

Full Text Available
APPLES: Fast Distance-Based Phylogenetic Placement.

Balaban, Metin; Sarmashghi, Shahab; Mirarab, Siavash (May 2019, Lecture notes in computer science)

Full Text Available
Skmer: assembly-free and alignment-free sample identification using genome skims

https://doi.org/10.1186/s13059-019-1632-4

Sarmashghi, Shahab; Bohmann, Kristine; P. Gilbert, M. Thomas; Bafna, Vineet; Mirarab, Siavash (December 2019, Genome Biology)

Full Text Available
APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments

https://doi.org/10.1093/sysbio/syz063

Balaban, Metin; Sarmashghi, Shahab; Mirarab, Siavash; Posada, ed., David (September 2019, Systematic Biology)

Abstract Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze data sets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.
more » « less

Search for: All records